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Abstract 

Nuclear Magnetic Resonance (NMR) spectra arc widely used in mctabolomics to 
obtain profiles of metabolites dissolved in biofluids such as cell supernatants. Methods 
for estimating metabolite concentrations from these spectra are presently confined to 
manual peak fitting and to binning procedures for integrating resonance peaks. Exten- 
sive information on the patterns of spectral resonance generated by human metabo- 
lites is now available in online databases. By incorporating this information into a 
Bayesian model we can deconvolve resonance peaks from a spectrum and obtain ex- 
plicit concentration estimates for the corresponding metabolites. Spectral resonances 
that cannot be deconvolved in this way may also be of scientific interest so we model 
them jointly using wavelets. 

We describe a Markov chain Monte Carlo algorithm which allows us to sample from 
the joint posterior distribution of the model parameters, using specifically designed 
block updates to improve mixing. The strong prior on resonance patterns allows the 
algorithm to identify peaks corresponding to particular metabolites automatically, 
eliminating the need for manual peak assignment. 

We assess our method for peak alignment and concentration estimation. Except 
in cases when the target resonance signal is very weak, alignment is unbiased and 
precise. We compare the Bayesian concentration estimates to those obtained from 
a conventional numerical integration method and find that our point estimates have 
sixfold lower mean squared error. 

Finally, we apply our method to a spectral dataset taken from an investigation 
of the metabolic response of yeast to recombinant protein expression. We estimate 
the concentrations of 26 metabolites and compare to manual quantification by five 
expert spcctroscopists. We discuss the reason for discrepancies and the robustness of 
our methods concentration estimates. 



Keywords: metabolomics, concentration estimation, prior information, multi com- 
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50 1. INTRODUCTION 

51 Metabolomics (also known as metabonomics or sometimes metabolic profiling) is a 

52 scientific discipline concerned with the quantitative study of metabolites, the small 

53 molecules that participate in metabolic reactions. Research in this field is expanding 

54 rapidly, with applications in many areas of biology and medicine including cancer 

55 (e.g. Griffiths et al. (2002)), toxicology (e.g. Lindon et al. (2003)), organism clas- 

56 sification (e.g. Bundy et al. (2002)), genetics (e.g. lUig et al. (2010)), biochemistry 

57 (e.g. Raamsdonk et al. (2001)), epidemiology (e.g. Holmes et al. (2008)) and disease 

58 diagnostics (e.g. Brindle et al. (2002)). 

59 Almost all experiments in metabolomics rely on measurements of the abundances 

60 of metabolites in complex biological mixtures, often biofluid or tissue samples. One 

61 of the most extensively used techniques for obtaining such quantitative information 

62 is proton nuclear magnetic resonance (^H NMR) spectroscopy. Metabolites generate 

63 characteristic resonance signatures in NMR spectra and each signature appears 

64 with intensity proportional to the concentration of the corresponding metabolite in 

65 the biological mixture. 

66 Specialized models and tools are needed to draw inferences from NMR spec- 

67 troscopic datasets, which are large and heavily structured. At present there is no 

68 statistical method for analyzing metabolomic NMR spectra reflecting the data gen- 

69 crating mechanisms and the extensive prior knowledge available, e.g. on the form of 

70 metabohte NMR signatures. In this paper, we describe novel Bayesian approaches for 

71 the analysis of NMR data from complex biological mixtures. Specifically, we de- 

72 velop new models reflecting the data generation mechanisms and our prior knowledge, 

73 using a combination of parametric functions and wavelets. We introduce a computa- 

74 tional strategy based on Markov chain Monte Carlo (MCMC), including novel block 

75 updates to overcome strong posterior correlation between the parametric functions 
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76 and the wavelets. We demonstrate the utihty of the approach with simulations and 

77 analyses of data from a yeast metabolomics experiment. 

78 1.1 NMR spectroscopy 

79 An NMR spectrum consists of a series of measurements of resonance intensity usu- 
Bo ally taken on a grid of equally spaced frequencies. Figure [1] is a representative sec- 
Bi tion (of about one tenth) of an NMR spectrum taken from an experiment into the 
B2 metabolomic response of yeast to recombinant protein expression. The x-axis of the 
B3 spectrum corresponds to resonant frequency and the y-axis to resonance intensity. 

B4 [Figure 1 about here.] 

B5 The spectrum is a collection of convolved peaks with different horizontal positions 

B6 and vertical scalings, each of which has the form of a Lorentzian curve. The zero 

B7 centered, standardized Lorentzian function takes the form 
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BB This is the pdf of a Cauchy distribution with scale parameter 7/2; in spectroscopy 7 

89 is called the peak-width at half-height (or sometimes the linewidth) . 

90 Each spectral peak corresponds to magnetic nuclei resonating at a particular fre- 

91 quency in the biological mixture. This frequency determines the displacement of the 

92 peak on the x-axis, which is known as its chemical shift and is measured in parts 

93 per million (ppm) of the resonant frequency of a standard peak. It is conventional 

94 in NMR spectroscopy to use 6 to denote chemical shift and for the 5-axis to increase 

95 from right to left. NMR only detects the resonance of hydrogen nuclei and a 

96 typical NMR spectrum has a range of about Oppm-lOppm. 

97 The resonant frequencies of a magnetic nucleus are largely determined by its 

98 molecular environment, that is the chemical structure of the molecule in which it is 



99 embedded and the configuration of its cliemical bonding witliin tlie molecule. Con- 

100 sequently, every metabolite has a characteristic molecular NMR signature, a con- 

101 volution of Lorentzian peaks that appear in specific positions in NMR spectra. 

102 These are the peaks observed in the NMR spectrum of a pure solution of the 

103 metabolite. The peaks of a signature can have quite different chemical shifts (when 

104 they are generated by protons with different bonding configurations) and so appear 

105 widely separated in a spectrum. 

106 Depending on its molecular environment, a proton may have more than one res- 

107 onant frequency and when this happens the frequencies are usually very similar. 

108 Consequently, the peaks generated appear in a spectrum as a juxtaposition called a 

109 multiplet. The shape of a multiplet (number of peaks, their separations and relative 
no heights) can be used to identify the corresponding metabolite. Figure [2] shows an 

111 NMR spectrum (top panel) and the resonance signatures of the four metabolites 

112 contributing the principal resonance signals (lower panels), with characteristic peak 

113 locations and multiplet shapes. 

114 [Figure 2 about here.] 

115 The intensity of a nuclear magnetic signal is proportional to the number of mag- 
na netically equivalent nuclei generating the resonance in the biological mixture. Conse- 

117 quently, every resonance peak (and therefore also every metabolite resonance signa- 

118 ture) scales vertically in a spectrum in proportion to the molecular abundance of the 

119 corresponding compound in the mixture. 

120 1.2 Specific challenges of NMR in metabolomics 

121 Biofiuids and tissue samples usually contain thousands of metabolites. However, 

122 NMR is relatively insensitive, so that ordinarily a spectrum contains quantitative in- 

123 formation on just a few hundred of the most abundant compounds. These compounds 
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124 can generate hundreds of resonance peaks in a spectrum, many of which overlap. 

125 To quantify a collection of metabolites using NMR, at least one resonance peak 

126 generated by each compound must be identified in the spectrum and deconvolved. 

127 (To reduce uncertainty, it is desirable to identify as many peaks as possible for each 

128 compound.) Estimates of the relative concentrations of the metabolites in the bio- 

129 logical sample can be made by comparing the areas under the deconvolved resonance 

130 peaks. (Estimates of absolute concentration require a reference compound). 

131 The peak identification step {assignment) is complicated by fluctuations in peak 

132 positions between spectra, caused by uncontrollable differences in experimental con- 

133 ditions and differences in the chemical properties of the biological samples, such as 

134 the pH and ionic strength. When this positional noise combines with peak overlap, 

135 assignment can become very hard indeed. 

136 [Figure 3 about here.] 

137 The left panel of Figure [3] illustrates the problem. Excerpts from two spectra corre- 

138 sponding to biological replicates from the same experiment are overlaid, focusing on a 

139 doublet type multiplet with two peaks. The difference in peak position between repli- 

140 cates is obvious to the eye. Here, the magnitude of the positional noise is insufficient to 

141 confuse assignment by an expert spectroscopist but it will pose problems for standard 

142 automated approaches. However, expert deconvolution is rarely practical because it 

143 is labor intensive and relies on someone familiar with metabolite resonance patterns. 

144 Targeted profiling (Weljie et al. 2006) against a standard library of metabolite reso- 

145 nance peaks reduces the importance of expert knowledge but is slow because there is 

146 no automated fitting procedure. Spectral binning (Holmes et al. (1994), Spraul et al. 

147 (1994)) approaches divide the spectrum into regions (bins), within which the inten- 

148 sity measurements are averaged, in an attempt to isolate distinct resonance signals. 
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149 Although this mitigates the effect of peaks fluctuating position within bins, fluctu- 

150 ations across bin boundaries will cause anti-correlated increase/decrease of average 

151 intensity in adjacent bins, even if there is no associated change in metabolite concen- 

152 tration. Spectral binning balances parsimony and computational efficiency, retaining 

153 quantitative information but in a representation with many fewer, easy to compute, 

154 variables. However, the reduced variables are often analyzed without explicit quantifi- 

155 cation of individual metabolites using pattern recognition methods such as principal 

156 components analysis and partial least squares (Lindon, Holmes and Nicholson 2001). 

157 The additional complication of positional noise combined with peak overlap is 
15B illustrated in the right panel of Figure [3l The well defined resonance peaks overlap 

159 with broad signals attributable to a combination of closely overlapping low metabolite 

160 signals and/or macromolecular signatures. This introduces the problem of estimating 

161 the proportion of the signal associated with the sharp resonances and the propor- 

162 tion due to the broad component. The problem becomes even more complex when 

163 the target peaks also overlap with other sharp metabolite signals which additionally 

164 fluctuate between different spectra. 

165 Currently, there is no statistical methodology that can simultaneously address the 

166 problems of identification, deconvolution and quantification when there is positional 

167 noise and peak overlap. We believe a method based on explicit quantification from 

168 deconvolution of metabolite signatures should have significant advantages: spectral 

169 convolution models are parsimonious because they correspond to the physical process 

170 generating the data; the variables inferred are interpretable because they represent 

171 concentrations of identified metabolites; these concentrations are of direct scientific 

172 interest because they depend on the underlying biology. 
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173 1.3 Contributions of this paper 

174 To tackle the problem of quantifying metabolites in complex biofluids such as cell su- 

175 pernatants or urine, we present a Bayesian model for NMR spectra and a Markov 

176 chain Monte Carlo (MCMC) algorithm to automate peak assignment and spectral 

177 deconvolution. Bayesian models for NMR data have been described before, notably 
173 in Bretthorst (1990a), in Bretthorst (19906), in many subsequent papers by the same 

179 author, in Dou and Hodgson (1996) and in Rubtsov and Griffin (2007). Our modeling 

180 exploits extensive prior information on the resonance signatures of the metabolites, 

181 including the expected horizontal displacements and relative vertical scalings of the 

182 peaks. This novel approach allows us to deconvolve peaks and assign them to specific 

183 metabolites in a unified analysis, which eliminates the need for a manual assign- 

184 ment step. The prior information comes from the physical theory of NMR and from 

185 experimental information. Experimental resonance data on human metabolites are 

186 extensive and are publicly available, for example from the online database of the 

187 Human Metabolome Project, the HMDB (Wishart et al. 2009). 

188 Almost all biofluid and tissue NMR spectra contain peaks for which there is no 

189 prior information in the presently incomplete public metabolite databases. Despite 

190 this, the component of a spectrum that cannot be assigned to known compounds, 

191 may contain metabolomic information that is scientifically useful, e.g. for classifica- 

192 tion of spectra. We therefore propose a two component joint model for a spectrum. 

193 We model the metabolites whose peaks we wish to assign explicitly parametrically, 

194 using information from the online databases, while we model the unassigned spec- 

195 trum semi-parametrically, using wavelets. We choose wavelets because they model 

196 signal continuously but locally. They can account for the local correlation of a spec- 

197 trum caused by the continuity of the underlying physical processes without imposing 
193 unrealistic global modeling constraints. 
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199 The wavelet component of our two component likelihood is extremely flexible 

200 so that, without restriction, it tends to absorb signal that should be modeled by 

201 the parametric component, thus inducing a lack of identifiability. We address this 

202 by penalizing the wavelet coefficients using heavy-tailed scale-mixture priors. These 

203 priors shrink wavelet coefficients wherever the spectral signal can be explained by 

204 the parametric component of the model. We also impose a truncation condition on 

205 the wavelets, which reflects prior knowledge that frequency-domain NMR spectra lie 

206 almost completely in the upper-half of the (x, y) plane. 

207 To overcome the strong posterior correlation between parameters corresponding 

208 to the two model components we introduce purposely designed Metropolis-Hastings 

209 block proposals which update the parameters of the two components jointly. 

210 2. MODELLING 

211 2.1 NMR Spectra 

212 Previous authors (Bretthorst (1990a), Dou and Hodgson (1996), Rubtsov and Griffin 

213 (2007)) developed Bayesian models for NMR data in the time domain, in which 

214 resonance signals appear as exponentially decaying sinusoids. However, we prefer 

215 to model conventionally preprocessed (by apodization, phase and baseline correction) 

216 data in the more interpretable frequency domain, in which resonance signals appear as 

217 peaks (e.g. Figured]). Our model exploits the positivity of the frequency-spectrum, a 

218 condition which cannot be expressed parsimoniously in the time domain. Under an iid 

219 Gaussian model for errors, the two representations contain the same information since 

220 they are related by an orthogonal transformation (the discrete Fourier transform). 

221 A frequency domain NMR dataset is a pair (x, y), where x is a length n vector 

222 of points on the chemical shift axis, usually regularly spaced and y is a vector of 

223 corresponding resonance intensity measurements, n is typically of the order 10'^ to 



7 



224 10^ depending on the resolution of the spectrum, and the size of the region under 

225 consideration. The intensity measurements are noisy, so that, although they measure 

226 inherently positive quantities, some components of y are likely to fall below the 5-axis. 

227 y is usually standardized in some way, for example so that Yli Hi — 1- 

We model y|x assuming the yj|x are independent normal random variables and. 



228 where the component of the model represents signal from metabolites with peaks 

229 we wish to assign explicitly and which have been previously characterized and cat- 

230 aloged in databases. (The metabolites chosen will vary from analysis to analysis 

231 according to the prior belief about the content of the biological mixture and the sci- 

232 entific question.) The ^ component of the model represents signal generated by peaks 

233 we do not wish to assign (this may include signal from uncataloged resonances of 

234 molecules which are partially characterized, with the characterized resonances mod- 

235 eled in the component). We construct parametrically (as a continuous function 

236 of continuous chemical shift 6) using the physical theory of NMR, and we model ^ 

237 semi-parametrically using wavelets. 

238 2.2 Modeling 0, the cataloged metabolite signal 

239 According to physics, the resonance signatures of distinct compounds are indepen- 

240 dent, accumulate with an intensity proportional to molecular abundance and aggre- 

241 gate in a spectrum by convolution. Consequently, we can write as a linear combina- 

242 tion, where each term in the sum corresponds to the signature of one of M different 

243 metabolites. 



E(yi|x) = (j){xi) + ^{xi) 



(2) 



M 



(3) 



m=l 
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244 (The value of M will depend on the scientific problem, it is likely to be of order 10° 

245 to 10^). For each m, tm is a continuous template function which specifies the NMR 

246 signature of metabolite m. The corresponding coefficient /3m is proportional to the 

247 molecular abundance of m (i.e. the concentration of m) in the biological sample. 

248 Physical theory restricts the model space for each template function to a parametric 

249 mixture of horizontally translated and vertically scaled Lorentzian peaks (Hore 1995). 

250 We remarked previously that many metabolite signatures contain clusters of Lorentzian 

251 peaks called multiplets. Isolated peaks are often classed as singlet multiplets and, if 

252 we adopt this convention, we can express each signature template tm completely as a 

253 linear combination of a set of multiplet curves gmu-, 

imiS) = ^ ZmugmuiS - S^J, (4) 
u 

254 where u is an index running over all the multiplets belonging to metabolite m. We 

255 assume gmuiS)dS = (7m„(5)(i5 for all m, m, so that the parameter specifies 

256 the position on the chemical shift axis of the center of mass of the uth multiplet of 

257 the mth metabolite (see the large multiplet in Figure H]). We call the chemical 

258 shift parameter of the multiplet. Each of the coefficients Zmu is a positive quantity, 

259 usually equal to the number of protons in a molecule of m that contribute resonance 

260 signal to the multiplet u. {z^u may sometimes be non-integral, due for example to 

261 relaxation effects (Hore 1995). In these cases Zmu must be interpreted as an 'effective' 

262 proton contribution). j°°^g„iu{,5)d5 is constant over m and u so the area under each 

263 tm is proportional to ^^Zmui the number of protons resonating in a molecule of m. 

264 With a few exceptions, multiplets can be classified into one of a number of com- 

265 mon types (Figure H]) which determine the configuration of the peaks (a doublet, a 

266 triplet, a doublet of doublets, etc). This classification together with a small number of 

267 continuous quantities called J-coupling constants, which determine the (horizontal) 
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268 distances between the peaks, completely parameterize a multiplet curve. 



269 [Figure 4 about here.] 

To be precise, a multiplet curve gmu is the weighted average of Vmu translated 
Lorentzian curves (see eqn. ([1])), 

v=l 

270 where the weights Wmuv (which sum to one over v) determine the relative heights of 

271 the peaks of the multiplet and the translations Cmuv determine the horizontal offsets 

272 of the peaks from the center of mass of the multiplet (see Figure HI). Multiplets are 

273 (usually) symmetric so that {-Cmuv ■ v = l,...,Vmu} = {cmuv ■ v = l,...,Vmu} and 

274 W^miiu' '^muv when Cjjiy^u' ^muv 

275 2.3 Modeling S,-, the uncatalogued metabolite signal 

276 We model (^(xi), ^(a;„))"^ as a linear combination of wavelet basis functions and 

277 use 6 to denote the vector of wavelet coefficients. We chose to use Daubechies's 

278 least asymmetric wavelets with 6 vanishing moments (symlet-6) as a wavelet basis 

279 because these wavelets have a similar shape to Lorentzian peaks. Symlets have been 

280 used previously to select features from NMR spectra (Kim et al. 2008) and sensitivity 

281 analysis comparing other potential wavelet bases showed little difference in spectral 

282 reconstructions. 

283 2.4 The Likelihood 

We now bring together the models for and ^ to make a formal specification of the 
probability model for the data. It is easier to do this in the wavelet domain, because 
the dimension of the wavelet space p often needs to be greater than the dimension 
of the data space n, to deal with distortion at the spectral borders (see Strang and 
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Nguyen (1996) and Section 1 of the supplementary material.) Let W be the wavelet 
transform corresponding to the symlet-6 wavelet basis. The likelihood, is defined by 

Wy = WT/3 + Ip6l + e, e~iV(0,VA), (6) 

284 where T is the n x M matrix with tm{xi) as its {i, m)th entry, Ip is the p x p identity 

285 matrix and where A is a scalar precision parameter. 

286 Equation ([S]) is a linear regression of Wy on the columns of [WT Ip], the matrix 

287 generated by adjoining WT and Ip columnwise. Since this matrix has more columns 

288 than rows, the regression coefficients cannot all be identifiable in the likelihood. We 

289 address this in the next section, by specifying a prior which helps to distinguish the 

290 parametric and semi-parametric components of the model. 

291 2.5 Prior specification 

292 Our aim is to obtain a joint Bayesian posterior distribution over the parameters con- 

293 trolling the shape of the templates {tm '■ fn = 1, M}, and the regression parameters 

294 /3, 6 and A; we now specify priors for these parameters. 

295 Prior for the peak-width: Our focus is on spectra generated by biofiuids such as cell 

296 supernatants or urine, for which peak-widths vary between, but negligibly within 

297 spectra; it is therefore reasonable to assume that peaks within a spectrum depend on 

298 a single common peak- width parameter 7. Our prior for 7 is a log-normal distribution, 

299 with Median(7) = IHz/F, Var(7) = 4.6Hz^/F^ where F is the operating frequency 

300 of the spectrometer in MHz. This prior gives good support to a broad region around 

301 IHz/F, typical of the peak-widths generated by modern spectrometers (Hore 1995). 

302 (With this prior, it is easy to relax the assumption of a common peak-width, since 

303 local deviations at the metabolite, multiplet, or peak level can be modelled using 

304 Gaussian random effects on log(7).) 
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305 Prior for the multiplets: In section l?]2] we described a two-level parameterization of 

306 metabolite signature templates, defined by @ and ([5]), as a linear combination of 

307 Lorentzian peaks nested in multiplets. This allows us to represent a difference in 

308 the uncertainty of peak positions within and between multiplets. The parameters 

309 Cmuv and Wmuv, which determine the multiplet shapes, vary very little across NMR 

310 spectra. We assume they are constant and compute them by applying some simple 

311 rules (see Hore (1995), chap. 3 for the details), from empirical estimates of the J- 

312 coupling constants which are published in online databases. In contrast, as noted in 

313 section 11.21 the multiplet chemical shift parameters do fluctuate slightly between 

314 spectra according to experimental conditions. We use an estimate 6*mu of each 6^^, 

315 taken from online databases, to construct an informative prior which accounts for this 

316 uncertainty. The positional noise is local and smaller fluctuations are more probable 

317 than larger ones, so we assign each a truncated normal prior distribution with 

318 mean parameter S*rnu, variance parameter 10~'^ppm^ and truncation region [6*rnu — 

319 O.OSppm, + O.OSppm]. It may sometimes be appropriate to specify a multiplet 

320 or metabolite specific alternative depending on what is known about the variability 

321 of particular multiplet locations across spectra. 

322 Prior for the metabolite abundances: Having defined a parametric prior for the 

323 metabolite signature templates, we now consider the prior for the vector (3, each 

324 component of which represents the resonance intensity of a signature and is propor- 

325 tional to the abundance of that metabolite in the biological sample. The intensities 

326 are positive so the prior for each component of f3 should confine its support to R+. 

327 For conditional conjugacy, we assign a normal prior to each /J^, truncated below at 

328 zero, /3m ~ TN{em, l/s^, 0, oo). This is flexible enough to encode prior information 

329 for a wide range of research problems. For the simulations and examples presented 

330 in this paper we assume low prior information and choose Cm = and Sm = 10~^ for 
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331 all m. 



332 Prior for the wavelet coefficients and precision parameter: In section [2^ we observed 

333 that the parametric (0) and semi-parametric (^) components of the model are not 

334 identifiable in the likelihood. In order to resolve the model components we penalize the 

335 semi-parametric component by assigning the wavelet coefficients a prior distribution 

336 with heavy tails and a concentration of probability mass near zero. In addition, to 

337 reffect prior-knowledge that NMR spectra are mostly restricted to the half-plane above 

338 the chemical shift axis, our prior penalizes models in which W^^O has components 

339 below a small negative threshold. To be precise, in order to specify a joint prior 

340 for {0, A) we introduce a vector of hyper-parameters ip, each component of which 

341 corresponds to a wavelet and a vector of hyper-parameters r, each component of 

342 which corresponds to a spectral data point. The joint prior for {6, if), r. A) has pdf 

343 proportional to 



jk 



exp 



jk 



r (t — hlr 



l{W-l0>TAhl„>T} 

(7) 

344 The index jk here corresponds to the kth wavelet in the jth wavelet-scaling level. 

345 The following lemma is proved in the supplementary material 

346 Lemma 1 Normalization of ^ defines a joint prior for {0,ip,T,X). This prior is 

347 proper. 

348 The joint prior specified by ([7]) was motivated by consideration of a scale mixture 

349 of multivariate normals with smoothed truncation limits: 
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ipjk ~ Gamma(cj, (ij/2), 



(9) 



TN{h,l/{Xr),-oo,h), 



(10) 



A ~ Gamma(a, 6/2), 



(11) 
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where C^tA is a normalizing constant. The index i here corresponds to the ith 
spectral data point. In this specification, i/? allows the prior precision associated 
with each wavelet to deviate from the global precision A. The gamma hyperprior 
on each component of i/) induces local shrinkage in the marginal prior for 0, which 
encourages posterior sparsity in the wavelet coefficients, r is a vector of n truncation 
limits, which bounds W^^O below. The decaying hyperpriors on the components of 
T smooth these limits and penalize 6 more heavily as more of the semi-parametric 
component component) of the model lies below the line y = h, where h is a small 
negative number, chosen close to zero on the spectral intensity scale. 

The joint distribution specified by (!8l)- (fT0l) is a reasonable representation of prior 
belief about 6 and A; it places a constraint on the conditional distribution of 6 given 

T and A. However, because the normalizing constant Cxip-r of ([H]) has no closed 
form, it is hard to devise a computationally efficient scheme to sample from the 
resulting 'doubly intractable' posterior (Murray et al. 2006). The prior defined by 
([7]) places the constraint on the joint distribution of the parameters, rather than the 
conditional distribution and it is easy to sample from the full conditionals of 6, t/?, 
T and A, if we use this prior. We contend that this is an equally valid specification 
and show in Section 2 of the supplementary material that it behaves similarly to the 
prior defined by fl8|)- f|T0|) . 

Figure [5] demonstrates the effect of penalizing the semi-parametric component of 
the model when it lies below the chemical shift axis. First, without penalizing ^ in the 
lower half plane, if the components of are given untruncated Student's-t priors, the 
posterior for the vector of quantification parameters (3 focuses asymptomatically on a 
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373 region close to the ordinary least squares estimate of the parameters, (T-^T)~^T'^y, 

374 while the wavelet component absorbs most of the residual spectrum. When a metabo- 

375 lite has a multiplet embedded in a region of unassigned spectral resonance the least 

376 squares estimate overestimates the corresponding quantification parameter. The sig- 

377 nature templates absorb spectral signal even when they do not match the shape of the 

378 spectral data and this leads to strong posterior support for negative values for some 

379 components of W^^O. The prior defined by ([7]) however, can correct the concentra- 

380 tion estimate providing at least one of the multiplets of the metabolite is deconvolved 

381 cleanly. 

382 [Figure 5 about here.] 

383 Scale-invariant inference and hyper parameter settings: Note that in the limit a — )■ 

384 0, 6 — )■ 0,Vm, — )■ 0, the (improper) prior is invariant under the scaling reparame- 

385 terisation (/3, 6, A) i— )■ {S(3, S6, X/S^) for every constant 5 > so that, in this limit, 

386 inference is unaffected by the scale of measurement of the data. Smaller a and b 

387 correspond to increased uncertainty in the value of A. For simulations and examples 

388 described in this paper we take a = 10~^ and b = 10~^. 

389 The values of the cj and dj control the degree of shrinkage penalization imposed on 

390 the wavelet coefficients. Experience shows cj = 0.05 and dj = 10~^ provides adequate 

391 penalisation. More stringent penalization is possible but our MCMC algorithm tends 

392 to mix less well as the penalty gets stronger, because our block updates are less good 

393 at targeting the true posterior distribution of 6 (see below), h controls the threshold 

394 below which the wavelets are penalised in the lower half plane while r controls the 

395 strength of that penalisation. We choose h a little below zero to be sure that true 

396 signals below y = 0, due for example to baseline wiggle can be adequately modelled 

397 by wavelets. We set h = —0.002 and r = 10^. 
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398 3. MARKOV CHAIN MONTE CARLO ALGORITHM 

399 We implemented a Markov chain Monte Carlo algorithm, to sample from the joint 

400 posterior distribution of the model parameters. There are three types of MCMC 

401 update. 



402 • Firstly, there are Gibbs samplers for the components of /3 (truncated normal), 

403 the components of (truncated normal), the components of i/? (gamma), the 

404 components of r (truncated normal) and A (gamma). The specific distributions 

405 for these updates are given in the supplementary material. 

406 • Secondly, there is a Metropolis-Hastings update for each of the parameters (ex- 

407 cept the components of /3) controlling the parametric component of the model. 

Specifically, in order to update the multiplet chemical shift parameter we 
propose from 



TN{51^, Vi. 5V« - 0.03ppm, i-^nu + 0.03ppm), (12) 



408 a Gaussian proposal, centred on the current parameter value and truncated at 

409 the boundaries of the prior distributions. We adapt the proposal variance Vi, 

410 using the Adaptive Metropohs-within-Gibbs algorithm of Roberts and Rosen- 

411 thai (2009). Vs*^^ is tuned to target an acceptance rate of 0.44 by increments 

412 and decrements which decay in magnitude asymptotically like the inverse of the 

413 square root of the iteration number. 

414 In order to update the peak- width parameter we make a Gaussian proposal for 

415 log 7, centred on the current parameter value. Again, we adapt the proposal 

416 variance following Roberts and Rosenthal (2009). 

417 The likehhood constrains Wy — WT/3 — d, inducing strong posterior correlation 

418 between the parametric and semi-parametric components of the model. Con- 
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419 sequently, updates of this type, which just propose changes to the parametric 

420 component, can only make local moves in the state space of the Markov chain. 

421 • Thirdly, there are Metropohs-Hastings block updates in each of which a param- 

422 eter controlling the parametric component of the model is updated jointly with 

423 the vector 6 of wavelet coefficients. The joint proposal breaks the posterior 

424 correlation between the parametric and semi-parametric model components, 

425 allowing larger jumps in state-space. 

426 Block updates of the extend the univariate proposals described previously. 

427 First we draw the univariate proposal (although we fix the proposal vari- 

428 ance, see supplementary material), then conditional on the value drawn we 

429 propose a new value 0' for the vector of wavelet coefficients d and perform a 

430 global Metropolis-Hastings accept-reject assessment for the block update. 

431 The conditional proposal for 6' is a multivariate truncated normal distribution 

432 with mean parameter Wy — WT'/3 (where T' is the template matrix updated to 

433 reflect the initial univariate proposal), precision parameter Alp and truncation 

434 W^^O' > T. We can simulate from this distribution by making the change 

435 of basis, T]' = W^^O'; the components of rj' are then independent univariate 

436 truncated normal distributions. This choice of conditional proposal is motivated 

437 by the full conditional of 6, 

0\il^, T, A, y ~ TMVN (Wy - WT/3, (Ip + *)"^ /A, W'^O > r) (13) 

438 which has a similar distribution but with a reduced precision. Unfortunately 

439 there is no easy way to sample from f|T3|) because the truncation condition 

440 W^^O > T induces a complex dependence structure between the components 

441 of 0. 
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442 The only proposals not yet described are those for the block updates of each 

443 component of f3 with 0. We propose from a Cauchy distribution, truncated 

444 below at zero. We center the Cauchy distribution on the point that maximizes 

445 the full conditional of Pm subject to = and to the truncation condition 

446 y — T/3 > T. This is a greedy proposal, in the sense that it attempts to maximise 

447 and explain as much of the spectral signal as possible using the template 

448 for metabolite m, excluding the wavelets {0 = 0). Conditional on the proposed 

449 we propose 0' from a multivariate truncated normal distribution with mean 

450 parameter Wy — WT/3' (where f3' is f3 with the mth component set to 

451 precision parameter Alp and truncation yV^^0' > r. 

452 Although, on average, block updates move the chain further than the correspond- 



453 ing single parameter updates, the acceptance rate is lower. Consequently moves of 

454 the second and third type make complementary contributions to MCMC mixing. 

455 Improving convergence and mixing: During the burn in stage of the MCMC we tem- 

456 per the likelihood and penalize the wavelet component of the model stringently to help 

457 the chain move into a region of good posterior support. A parameter T (see section 

458 2 of supplementary material), which jointly controls the temperature/penalization, 

459 gradually cools according to a deterministic schedule, proportional to the complement 

460 of a Gaussian cdf. 

461 To improve mixing, we implemented a population (multi-chain) version of our 

462 algorithm in which the MCMC operates on a product state-space composed of copies 

463 of the space described in Section |2l The chain targets a tempered version of the 

464 posterior distribution of interest marginally on each subspace, but with each subspace 

465 taking a different value of T. The Metropolis-Hastings updates already described 

466 operate within each subspace; the acceptance probabilities for these depend only on 
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467 the state of the relevant subchain and so they can be carried out in parallel on different 

468 CPU cores. Additionally, there are two types of update which allow the transfer of 

469 information between subspaces. Firstly, we propose that subspaces adjacent in the 

470 ordering of T swap parameter values, as in the exchange moves of parallel tempering 

471 (Geyer 1991). Secondly, for a multiplet location parameter 5*^^ in a given subchain, 

472 we pick a complementary subchain uniformly at random and propose a new value 

473 for 51^^ from a Cauchy distribution, centred on the value of the parameter in the 

474 complementary chain, with scale-parameter 3ppm. We then propose a conditional 

475 update for the 6 of the given subchain in the manner of the block-updates already 

476 described. This proposal is made for every chemical shift parameter of every subchain 

477 once for each iteration of the MCMC algorithm. It allows good values of a chemical 

478 shift parameter to spread through the population of chains. In principle additional 

479 information sharing moves (e.g. Evolutionary Monte Carlo crossover (Liang and 

480 Wong 2000)) could be added, but we find this 'copy' move to be sufficient. Ergodicity 

481 and the marginal convergence of the subchain with T = 1 to a stationary distribution 

482 equal to the target posterior is assured by theory (Jasra, Stephens and Holmes 2007). 

483 We combine annealing with parallel chains by running a ladder in which the ratios of 

484 the temperatures of the subchains are constant over time, and for which the target 

485 subchain (that with T — 1) cools according to the specified annealing schedule. 

486 4. PERFORMANCE 

487 We simulated 100 biofluid NMR spectra by convolving the standard library spec- 

488 tra for pure samples of acetic acid, alanine, betaine, creatine, glycine, glycolic acid, 

489 guanidoacetic acid, lactic acid, succinic acid, taurine, trimethylamine and trimethy- 

490 lamine oxide all of which generate resonances in the region Oppm — 5ppm. The 

491 concentration of each metabolite was generated as a C/[0, 1] random variable and the 

492 multiplets of each metabolite were perturbed by draws from C/[— 0.03ppm, O.OSppm]. 
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493 We mixed each simulated spectrum with a broad Gaussian pdf the mean of which 

494 was drawn uniformly from the range of the spectrum (Oppm — 5ppm) and standard 

495 deviation of which was fixed at 5.8ppm. This hump is typical of the broad baseline 

496 distortions commonly generated by macromolecules in biofluid spectra. 

497 Because each spectrum in the standard library was generated from a different 

498 NMR experiment, the width of the peaks in a simulated spectrum can vary slightly 

499 between resonances generated by different compounds. This variability can affect 

500 inferences of concentration made with our model because of correlation between con- 

501 centration and peak-width parameters. To deal with this artefact of the simulator, 

502 we extended the model by adding random effects to allow for some (small) inter- 

503 metabolite variability in the peak-width. 

504 We applied the (single chain version of the) MCMC algorithm for the extended 

505 model to each simulated spectrum, running for 5 000 iterations with a 3000 iteration 

506 burn in. 

507 4.1 Multiplet Localization 

508 Figure [6] is a normalised histogram of the difference between the Bayesian poste- 

509 rior mean estimate of each multiplet chemical shift and the true simulated chemical 

510 shift, illustrating the quality of the peak localisation. The Bayesian posterior mean 

511 estimates 85% of shifts within 0.002ppm and 90% within O.OlSppm. 

512 [Figure 6 about here.] 

513 4.2 Quantification 

514 We compared our Bayesian method for quantification to the following algorithm for 

515 numerical integration, which is slightly more sophisticated than conventional spectral 

516 binning because it includes a peak identification step. To integrate a multiplet we used 

517 the known simulated perturbation in chemical shift to identify the precise location of 
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518 the multiplet. In practice, this information would be unknown to a spectroscopist, at 

519 best he or she might be able to identify a multiplet by eye and choose bin boundaries 

520 based on the observed location. The information is, of course, unavailable to the 

521 Bayesian method. Using the pure compound spectrum we identified the region [L, R] 

522 of the chemical shift axis that corresponds to the central 95% mass of the multiplet 's 

523 parametric template. Finally, we estimated the total area under the multiplet by 
''2" {J2L<xi<Ryi) I (0.95A^(L — /?)), where is the number of Xi in [L, B\. 

525 The plots in Figure [7] show the actual vs. estimated concentrations for estimates 

526 made by numerical integration and Bayesian posterior mean. The numerical method 

527 is systematically biased towards overestimation for two reasons. Firstly, it fails to 

528 take account of the Gaussian background hump and secondly it cannot distinguish 

529 resonance in the region [L, i?] generated by the metabolite of interest from other con- 

530 founding signals. The Bayesian estimate is unaffected by the first problem because 

531 the background hump is modeled with wavelets; the second problem is dealt with 

532 by deconvolution. The mean quadratic error for the numerical estimator is 0.106 

533 (equivalent to Pearson correlation 0.70), where the mean is taken over all simulation 

534 replicates and metabolites. In comparison, the quadratic error for the Bayesian pos- 

535 terior estimator is 0.017 (equivalent to Pearson correlation 0.89), over sixfold smaller. 

536 [Figure 7 about here.] 

537 To investigate the performance of the Bayesian method under different strengths 

538 of multiplet chemical shift perturbation we ran an additional 150 replicates of the 

539 simulations previously described in 3 blocks of 50 with perturbations drawn from 

540 [/[— O.Olppm, O.Olppm], f/[— 0.02ppm, 0.02ppm] and 0.04ppm, 0.04ppm] in each 

541 block. We extended the truncation on the chemical shift priors to ±0.045ppm the 

542 database estimate. The peak assignments and concentration estimation errors are 
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543 summarized in Table [H There is no evidence of a major trend in the quahty of peak 

544 assignment or concentration estimation with the magnitude of multiplet perturbation. 

545 [Table 1 about here.] 

546 5. YEAST DATASET 

547 To illustrate the performance of our method on a real dataset, we took three spectra 

548 from the experiment investigating the metabolic response of the yeast Pichia pastoris 

549 to recombinant protein expression (Tredwell, Edwards- Jones, Leak and Bundy 2011). 

550 The spectra were generated using biological replicates prepared under the same con- 

551 ditions. Consequently, the metabolic profiles of the samples are extremely similar 

552 and the spectra contain essentially the same metabolite concentration information. 

553 Nevertheless, the spectra are slightly different, because for example, of experiment 

554 level positional noise in the chemical shifts of resonance peaks. By modeling the 

555 three spectra jointly we can quantify metabolites using information from all three 

556 replicates, while accounting for these experiment level differences. 

557 We used the model described in Section |2] as a basis for a joint model of multiple 

558 spectra. In the new model, the vector of metabolite quantification parameters (3 is 

559 held in common across the spectra. All the remaining parameters are copied from 

560 the original model, with a replicate set assigned to each spectrum. The MCMC 

561 algorithm for the multiple-spectra model is very similar to the procedure described 

562 for the original model. The Metropolis-Hastings updates involving components of f3 

563 need to be adjusted, to reflect the dependence on multiple spectra, but are similar 

564 to those for the simpler model (see section 3 of the supplementary material). The 

565 updates for the remaining parameters continue to be valid within each spectrum 

566 because, conditional on (3, the joint posterior factorizes into separate probability 

567 models, each corresponding to a different spectrum. 
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568 Tredwell, Behrends, Geier, Liebeke and Bundy (2011) manually quantified 37 

569 metabolites from these spectra, with each of the five authors assigning the resonances 

570 and estimating concentrations independently. Not all the resonance patterns gener- 

571 ated by the 37 metabolites take the form of the symmetric multiplets described in 

572 Section [2l Multiplet shapes are sometimes distorted by strong interaction effects, 

573 which we cannot easily include in our model because they are neither described by 

574 a known parametric model nor cataloged in a public database. However, distorted 

575 multiplets are still convolutions of Lorentzian peaks, so it is sometimes possible to 

576 construct a template-based model by estimating the weights and translations of ([5]) 

577 from an NMR spectrum of the relevant pure compound generated under similar ex- 

578 perimental conditions. We were able to construct a parametric signature template 

579 for 26 of the 37 compounds by combining public database information with param- 

580 eter estimates from our laboratory library of pure compound spectra. However, we 

581 were unable to construct complete signatures, containing a full complement of mul- 

582 tiplets, for every metabolite. Although this precludes a complete deconvolution of 

583 the spectral signal generated by compounds with unmodeled resonances, and the 

584 omitted resonance signals will be absorbed into the wavelet component of the model. 

585 Nevertheless, our main aim is to obtain accurate concentration estimates and this 

586 is still achievable, providing at least one multiplet from each metabolite deconvolves 

587 correctly. 

588 We ran the MCMC procedure, with 8 parallel chains tempered on a ladder, for 

589 20 000 iterations following a 10 000 iteration burn in. We made an adjustment 

590 to the prior on the chemical shift parameters of the singlet multiplets generated 

591 by Histidinol near to 7.25ppm and 8.19ppm by truncating the prior at ±0.5ppm 

592 of the HMDB estimate rather than at ±0.03ppm. This represents prior knowledge 

593 that the chemical shifts of those multiplets are more variable than is typical because 
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594 of sensitivity to chemical properties of the biofluid. Figure |8] shows the posterior 

595 deconvolution of a heavily congested region (2.6ppm — S.lppm) and a region of broad 

596 (O.Sppm — 1.4ppm) from one of the spectra. 

597 The five spectroscopists deconvolved the spectra with the assistance of the widely 

598 used Chenomx spectrographic software, which implements a form of targeted profil- 

599 ing (Weljie et al. 2006). This is probably the most precise method currently available 

600 for estimating metabolite concentrations from spectra, although its accuracy depends 

601 on spectroscopists being able to make correct peak assignments. Figure [9] is a plot 

602 showing the strong concordance between the concentration estimates of the spectro- 

603 scopists and concentration estimates made by the Bayes posterior mean. Although 

604 1 1 of the 26 posterior mean estimates lie outside the range of the spectroscopists' 

605 estimates, there are only 3 cases of substantial discordance. The statistical estimates 

606 for arginine and histidinol are substantially larger than the spectroscopist's estimates 

607 while that for malic acid is substantially lower. In the cases of arginine and histidinol 

608 it is hard to fault the posterior deconvolution by visual inspection. The discrepancies 

609 could be due to an error in the templates used by the spectroscopists to profile the 

610 resonances or to experimental errors in the database estimates of the parameters used 

611 to construct the metabolite signature templates of Bayesian model. In the case of 

612 malate the discrepancy appears to have been caused by a multiplet misalignment. In 

613 principle this could be resolved by adjusting the prior for that multiplet 's chemical 

614 shift parameter in order to force the correct alignment. 

615 [Figure 8 about here.] 

616 [Figure 9 about here.] 
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617 6. DISCUSSION 

618 Presently, automatic methods for analyzing biofluid NMR data rely on non-parametric 

619 pattern recognition techniques or are based on approximate numerical integration 

620 algorithms, such as spectral binning. These methods ignore a large amount prior 

621 information about the physical process generating the spectral data. 

622 Prior information about a data generating process can easily be incorporated into 

623 a Bayesian analysis through specification of a likelihood and specification of a prior 

624 distribution for the parameters of the likelihood. We have shown that a Bayesian 

625 model for biofluid spectra, which exploits an informative parametric prior for the 

626 patterns of resonance generated by selected metabolites, can be used to deconvolve 

627 those resonances from a spectrum and to obtain explicit concentration estimates for 

628 the metabolites. 

629 Simulations show that our MCMC algorithm usually identifies spectral resonance 

630 peaks precisely. Peak misalignment may occur when a target resonance for a multi- 

631 plet of a template appears in the spectrum close to other stronger signals. The model 

632 may then encourage the template to align incorrectly with the stronger signals, even 

633 if they have the wrong shape. This is because the wavelet coefficients are heavily 

634 penalized in the prior but the parametric templates are not. Even when the model 

635 posterior concentrates around an incorrect deconvolution, the strong prior penaliza- 

636 tion on negative spectral signal means that posterior estimates of concentration can 

637 still be accurate, providing at least one multiplet for each metabolite deconvolves cor- 

638 rectly. Concentration estimation, the main motivation for the modeling, is therefore 

639 quite robust to mis- assignment of spectral resonances. 

640 It is worth noting that resonance mis- assignment is a problem for all methods, 

641 (including manual assignment by an expert; it is unavoidable for binning methods 

642 when peaks overlap) and our approach suggests two methods for resolving mistakes. 
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643 Firstly, signature templates corresponding to the compounds generating confounding 

644 signals can be added to the parametric component of the model (providing they are 

645 available). Secondly, the prior on the chemical shift parameter can be adjusted to fix 

646 the position of a misaligned multiplet. 

647 Our approach yields improved concentration estimates. A comparison with a 

648 method for quantifying metabolites based on numerical integration shows the poste- 

649 rior mean estimates of the Bayesian model to be 6 fold more accurate in quadratic 

650 error, even when exact multiplet locations are given to the numerical integration 

651 algorithm. 

652 We are able to fit M = 20 metabolites to an n = 3 000 datapoint spectrum in 

653 about 60 minutes on a 2.2GHz desktop machine using less than 0.5GB of RAM when 

654 running the MCMC algorithm for 10 000 iterations. (If multiple chains are required 

655 to improve mixing, they will run in parallel on a multicore machine.) The number 

656 of operations required by the MCMC algorithm is linear in n and M. The memory 

657 requirement is linear in the number of MCMC iterations and quadratic in n. This rate 

658 of computation can easily compete with the rate of acquisition of spectra by a typical 

659 NMR laboratory, effectively removing a major bottleneck in laboratory pipelines. 

660 An accurate, automatic method for estimating metabolite concentrations from 

661 ^H-NMR spectra will assist many research projects in metabolomics. The field relies 

662 heavily on NMR for metabolite quantification and currently, even projects analyz- 

663 ing a few tens of spectra use numerical integration for estimation. Bayesian model- 

664 ing should become increasingly useful as prior information on metabolite resonance 

665 patterns becomes more accurate and extensive. For example, with more detailed 

666 information our template model could be extended to deal systematically with the 

667 effects of interactions between multiplets. We plan to develop our model further and 

668 to release an efficient implementation of our methodology capable of simultaneously 
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669 deconvolving the majority of metabolites assignable in the NMR spectra of complex 

670 biological mixturesO 
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Figure 1 : A section from an NMR spectrum from an experiment investigating protein 
expression in yeast. The x-axis measures chemical shift in parts per miUion (ppm). 
The y-axis measures relative resonance intensity. 



33 



— I 1 1 1 1 1 — 

3.4 3.2 3.0 2.8 2.6 2.4 



o 







T3 

N 

T3 
i_ 

CO 

■o 

CO 
^— « 

CO 



5 (ppm) 




5 (ppm) 




5 (ppm) 



o 
o 




— I 1 — 

2.8 2.6 



3.4 3.2 



3.0 



5 (ppm) 



2.4 



o ■ 1 1 1 1 1 r 

3.4 3.2 3.0 2.8 2.6 2.4 



5 (ppm) 



Figure 2: An NMR spectrum (top panel) with the principal resonance signals 
deconvolved into the metabolite NMR signatures (lower panels) of (in descending 
order by panel) citric acid, 2-oxoglutaric acid, taurine and trimethylamine N-oxide. 
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Figure 3: Positional noise between, and peak overlap within, two NMR spectra taken 
from the yeast experiment; resonances are generated by alanine {left) and threonine 
(right). 
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Figure 4: The peak configurations of some common types of multiplet. Top row: 
a doublet, with chemical shift S'^^ and peak offset Cmuv Middle row: (from left to 
right) a singlet, a triplet, a quadruplet. Bottom row: (from left to right) a quintuplet, 
a doublet of doublets and a triplet of doublets. 



36 





Figure 5: The effect of a prior penalizing the ^ component of the hkehhood in the lower 
half plane {top) compared to one without this penalization (bottom). The dashed lines 
show the spectral data. Deconvolution of a parametric metabolite signature template 
(heavy lines, left) can be more accurate when the wavelet component (heavy lines, 
right) is penalized below the 5-axis. 
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Figure 6: Normalized histogram of simulated minus Bayesian posterior mean esti- 
mates of multiplet shift. 
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Simulated Concentraaon 



Simulated Concentration 



Figure 7: Concentration estimates from simulated spectra: true vs estimated con- 
centrations for numerical integration {left) and for Bayesian posterior mean {right). 
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Manual estimates of spectroscopists 



Figure 9: Comparison of metabolite concentration estimates by posterior mean of 
the quantification parameters from the Bayesian model (y-axis) with manual spec- 
troscopists estimates (x-axis). The scales are calibrated so that the concentration of 
Trehalose is equal to 1. Each circle represents the mean of the five spectroscopists 
estimates while the limits of the horizontal bars represent the range. The inset is a 
magnification of the region bounded by the dashed line. 
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Table 1: Peak assignment and concentration estimation errors for different magni- 
tudes of simulated chemical shift perturbation. 



Distribution of Simu- 
lated Perturbations 


Peaks Assigned 
Within 0.002ppm 


Mean Quadratic 
Error for Bayes 
Estimate 


Mean Quadratic 
Error for Numer- 
ical Integration 
Estimate 


O.Olppm, O.Olppm] 


83% 


0.016 


0.141 


C/[-0.02ppm, 0.02ppm] 


84% 


0.016 


0.102 


C/[-0.04ppm, 0.04ppm] 


83% 


0.017 


0.135 
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